Summary of Analysis

In this analysis, we investigate the impact of removing specific cell types from the reference dataset and mapping/querying them using the modified reference. The main steps include:

Cell Type Removal: CD4, CD8, CD14, and NK cell types are systematically removed from the reference dataset.

Mapping and Querying: Utilizing the modified reference, we map the query dataset.

Principal Component Regression: Regression of principal components (PCs) against predicted cell types is performed. We visualize PC scatter plots and box plots.

Distance Calculation: Calculation of Euclidean distances between the cell types in the reference and the corresponding cell types in the query.

Inter-Cell Type Distances: Computation of distances between the cell types removed from the reference and the cell types to which they were assigned in the query.

Introduction

Welcome to this document, where we delve into the results of a regression analysis involving Principal Components (PCs) and cell labels. Our exploration encompasses two distinct scenarios:

Regression of PCs against Observed Cell Labels in Query Dataset: This scenario involves the regression of Principal Components against the actual cell labels within the query dataset.

Regression of PCs against Predicted Cell Labels in Query Dataset: In this scenario, we perform regression of Principal Components against predicted cell labels in the query dataset. These predicted labels are derived from systematically excluding one cell at a time from the reference dataset. The modified reference is then employed to map the query dataset.

The analysis hinges on a Single-Cell RNA sequencing (scRNA-seq) 10x Genomics PBMC dataset. This dataset is divided into two subsets: the reference set (10x Single-Cell RNAseq PBMCs - Reference) and the query dataset (10x Single-Cell RNAseq PBMCs - Withheld Test Set). The dataset can be accessed from the following link: https://figshare.com/projects/Curated_10X_Data/137892

library(SingleR)
library(scater)
library(Polychrome)
library(plotly)
library(ggplot2)

Data Processing

In this section, we load the reference and withheld datasets, log-transform the data, and explore the impact of removing different cell types from the reference dataset on prediction accuracy.

PCA of reference

Analysis - Removing Cell Types

Here, we analyze the effect of removing individual cell types from the reference dataset on prediction accuracy using the SingleR package. We have a reference dataset containing CD4, CD8, CD14, and NK cell types. We remove CD4 cells from the reference and map query using this reference. Similarly, this process is performed for CD8, CD14, and NK cells individually, each time removing one cell type and mapping query with this modified reference.

# Removal of one cell type at a time from reference
tab <- lapply(unique(reference$label), function(i) {
  indx <- which(reference$label == i)
  refn <- reference[, -indx]
  annotations <- SingleR(query, ref = refn, labels = refn$label)
  table_result <- as.data.frame(table(annotations$labels, query$label))
  return(list(annotations = annotations, table_result = table_result))
})

df <- mapply(function(tbl_result, label) {
  indx <- which(tbl_result$Var2 == label)
  tbl_result[indx, ]
}, lapply(tab, `[[`, "table_result"), unique(reference$label), SIMPLIFY = FALSE)

df = do.call(rbind, df)
df <- subset(df, Freq > 0)

# Concordance plot
df$Var2 <- factor(df$Var2, levels = c("CD8", "NK", "CD4", "CD14"))
ggplot(df, aes(x = Var1, y = Var2, size = Freq)) + 
  geom_point() + 
  xlab("Predicted labels") + 
  ylab("Removed labels") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The scatter plot above displays removed cell types from reference on y axis and predicted cell type on x axis. For instance, if from reference dataset CD8 cells are removed, It will be predicted as CD4 cells in a query dataset.

PCA and Regression analysis

In this section, we apply PCA to the query dataset and perform regression analysis to understand the relationship between principal components and observed cell labels from query dataset.

query <- runPCA(query)

# Function to regress PCs against cell labels
regress_PCs <- function(data) {
  
  # Calculate R-squared values
  lm_models <- lapply(1:10, function(i) {
    lm_model <- lm(paste0("PC", i, " ~ CellTypes"), data = data)
    return(lm_model)
  })
  
  # Calculate R-squared values
  rsquared <- sapply(lm_models, function(model) {
    rsq <- summary(model)$r.squared
    return(rsq)
  })
  
  rsquared_df <- data.frame(PC = 1:10, R2 = rsquared)
  return(rsquared_df)
}

# Function to create scatter plot and convert to plotly
scatter_to_plotly <- function(data, x_col, y_col, color_col, title) {
  p <- ggplot(data, aes_string(x = x_col, y = y_col, color = color_col)) +
    geom_point(size = 0.5) +
    scale_color_manual(values = color_palette, guide = guide_legend(override.aes = list(size = 2))) +
    labs(x = x_col, y = y_col, title = title) +
    theme_bw()
  
  return(ggplotly(p))
}

calculatePairwiseDistancesAndPlotDensity <- function(query_data, ref_data, query_cell_type_col, ref_cell_type_col, cell_type_query, cell_type_reference,distance_metric, correlation_method = "pearson") {
  # Subset query and reference data to the specified cell type
  query_data_subset <- query_data[, !is.na(query_data[[query_cell_type_col]]) & query_data[[query_cell_type_col]] == cell_type_query]
  ref_data_subset <- ref_data[, !is.na(ref_data[[ref_cell_type_col]]) & ref_data[[ref_cell_type_col]] == cell_type_reference]
  
  # Convert to matrix
  query_mat <- t(as.matrix(assay(query_data_subset, "logcounts")))
  ref_mat <- t(as.matrix(assay(ref_data_subset, "logcounts")))
  
  # Combine query and reference matrices
  combined_mat <- rbind(query_mat, ref_mat)
  
  # Calculate pairwise distances or correlations for all comparisons
  if (distance_metric == "correlation") {
    if (correlation_method == "pearson") {
      dist_matrix <- cor(t(combined_mat), method = "pearson")
    } else if (correlation_method == "spearman") {
      dist_matrix <- cor(t(combined_mat), method = "spearman")
    } else {
      stop("Invalid correlation method. Available options: 'pearson', 'spearman'")
    }
  } else {
    dist_matrix <- dist(combined_mat, method = distance_metric)
  }
  
  # Convert dist_matrix to a square matrix
  dist_matrix <- as.matrix(dist_matrix)
  
  # Extract the distances or correlations for the different pairwise comparisons
  num_query_cells <- nrow(query_mat)
  num_ref_cells <- nrow(ref_mat)
  dist_query_query <- dist_matrix[1:num_query_cells, 1:num_query_cells]
  dist_ref_ref <- dist_matrix[(num_query_cells+1):(num_query_cells+num_ref_cells), (num_query_cells+1):(num_query_cells+num_ref_cells)]
  dist_query_ref <- dist_matrix[1:num_query_cells, (num_query_cells+1):(num_query_cells+num_ref_cells)]
  
  # Create data frame for plotting
  dist_df <- data.frame(
    Comparison = c(rep("Query vs Query", length(dist_query_query)),
                   rep("Reference vs Reference", length(dist_ref_ref)),
                   rep("Query vs Reference", length(dist_query_ref))),
    Distance = c(as.vector(dist_query_query),
                 as.vector(dist_ref_ref),
                 as.vector(dist_query_ref))
  )
  
  # Plot density plots
  ggplot(dist_df, aes(x = Distance, color = Comparison)) +
    geom_density() +
    labs(x = ifelse(distance_metric == "correlation", paste(correlation_method, "correlation"), "Distance"), y = "Density", title = "Pairwise Distance Analysis and Density Visualization") +
    theme_bw()
}

Visualizing Regression Results

Note: Here original/observed query data cell labels are used.

We visualize the outcome of the regression analysis by creating line plots that depict the R-squared values for each principal component.

PC1: The high R-squared value indicates that a significant portion of the variability in PC1 is explained by the cell labels. This suggests that PC1 captures meaningful differences between the cell types.

PC2: Captures a substantial amount of variation related to cell labels. This component contributes significantly to distinguishing between different cell types.

PC3: The R-squared value of PC3 indicates that it still captures a moderate amount of variation explained by cell labels, although it is less influential than PC1 and PC2.

PC4 to PC10: The R-squared values for PC4 to PC10 are relatively lower, ranging from 0.0366 to 0.0073. These components capture much less of the variation explained by the cell labels. Their low R-squared values might suggest that these principal components might be influenced by other factors not accounted for by the provided cell labels

# Data preparation
query_observed <- data.frame(reducedDim(query, "PCA"), CellTypes = query$label)
rsquared_df <- regress_PCs(query_observed)

# Create a line plot using ggplot
ggplot(rsquared_df, aes(x = PC, y = R2)) +
  geom_line() +
  geom_point(shape = 16) +
  labs(x = "Principal Component (PC)",
       y = "R-squared",
       title = "R-squared for Principal Components") +
  ylim(0, 1) +
  theme_bw() + 
  scale_x_continuous(breaks = 1:10, labels = 1:10)

Scatter plots between different PCs for query dataset (Observed query dataset)

# Define color palette
seeds <- c("#FF0000FF", "#CCFF00FF", "#00FF66FF", "#0066FFFF", "#CC00FFFF")
color_palette <- createPalette(length(unique(query_observed$CellType)), seeds)
names(color_palette) <- unique(query_observed$CellType)

PC1 vs PC2 scatter plot

The scatter plot of PC1 vs PC2 indicates that there are separate clusters for NK cells and CD14 cells, which means that these two cell types are well-separated from the others in this two-dimensional space defined by PC1 and PC2. This separation suggests that PC1 and PC2 capture distinct variations that allow for distinguishing between NK cells and CD14 cells from other cell types.

scatter_to_plotly(query_observed, "PC1", "PC2", "CellTypes", "PC1 vs PC2")

PC1 vs PC3 scatter plot

The variation captured by PC3 contributes to the separation of CD14 cells from other cell types when looking at the plot with PC3.

scatter_to_plotly(query_observed, "PC3", "PC1", "CellTypes", "PC1 vs PC3")

PC1 vs PC4 scatter plot

Similar to above results, the variation captured by PC4 contributes to the separation of CD14 cells from other cell types.

scatter_to_plotly(query_observed, "PC4", "PC1", "CellTypes", "PC1 vs PC4")

PC4 vs PC5 scatter plot

Plot of PC4 vs PC5 shows a mixed pattern with no clear separation or clustering of cell types, suggesting that the variation captured by these PCs may not have a strong discriminatory effect on the different cell types in the dataset.

scatter_to_plotly(query_observed, "PC4", "PC5", "CellTypes", "PC4 vs PC5")

Boxplots of PC1 vs Cell types

CD14 cells might exhibit distinct molecular characteristics that set them apart from the other three cell types.

ggplot(query_observed, aes(x = CellTypes, y = PC1, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC1") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")

Boxplots of PC2 vs Cell types

These results indicate that NK cells might exhibit distinct gene expression patterns or molecular characteristics captured by PC2.

ggplot(query_observed, aes(x = CellTypes, y = PC2, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC2") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")

Boxplots of PC3 vs Cell types

ggplot(query_observed, aes(x = CellTypes, y = PC3, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC3") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")

Query dataset when CD4 cell type is removed from reference dataset

In this section, we focus on the query dataset that was generated after removing CD4 cell types from the reference dataset. We perform analysis on this modified dataset to explore how the Principal Components (PCs) relate to the predicted cell labels.

PC1 captures a high amount of variance, suggesting that it explains a substantial portion of the variability in the data.

PC2 also captures a significant amount of variance, indicating its importance in explaining variability.

The R-squared values for PC3 to PC10 are much lower, indicating that these components capture relatively less variance in the data.

# Data preparation
query1 <- data.frame(reducedDim(query, "PCA"), CellTypes = tab[[1]]$annotations$labels)
rsquared_df <- regress_PCs(query1)

# Create a line plot using ggplot
ggplot(rsquared_df, aes(x = PC, y = R2)) +
  geom_line() +
  geom_point(shape = 16) +
  labs(x = "Principal Component (PC)",
       y = "R-squared",
       title = "R-squared for Principal Components (CD4 cells removed from reference dataset)") +
  ylim(0, 1) +
  theme_bw() + 
  scale_x_continuous(breaks = 1:10, labels = 1:10)

Scatter plots between different PCs for query dataset (from reference CD4 cells removed)

We create scatter plots to visualize the PCA results for the query dataset with CD4 cell types removed from the reference.

# Define color palette
seeds <- c("#FF0000FF", "#CCFF00FF", "#00FF66FF", "#0066FFFF", "#CC00FFFF")
color_palette <- createPalette(length(unique(query1$CellType)), seeds)
names(color_palette) <- unique(query1$CellType)

PC1 vs PC2 scatter plot

The scatter plot of PC1 vs PC2 for the query dataset with CD4 cell types removed from the reference demonstrates a distinct separation of clusters. This separation suggests that the variations captured by PC1 and PC2 are effective at distinguishing different cell types in the query dataset.

scatter_to_plotly(query1, "PC1", "PC2", "CellTypes", "PC1 vs PC2")

PC1 vs PC3 scatter plot

The separation of CD14 cells from the others along the PC3 axis is notable, indicating that the variation captured by PC3 contributes significantly to distinguishing CD14 cells from the rest. Additionally, the proximity of CD8 and NK cells along the PC3 axis suggests that these cell types may share some common gene expression patterns or molecular characteristics captured by PC3.

Also, concordance plot above revealed that when CD8 cells were removed from the reference dataset, a small subset of cells in the query dataset were being assigned as NK cells.

scatter_to_plotly(query1, "PC3", "PC1", "CellTypes", "PC1 vs PC3")

PC2 vs PC3 scatter plot

Separation of NK cells from other cell types seems to be influenced by both PC2 and PC3,

scatter_to_plotly(query1, "PC3", "PC2", "CellTypes", "PC3 vs PC2")

Boxplots of PC1 vs Cell types

ggplot(query1, aes(x = CellTypes, y = PC1, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC1") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")

Boxplots of PC2 vs Cell types

ggplot(query1, aes(x = CellTypes, y = PC1, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC2") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")

Boxplots of PC3 vs Cell types

ggplot(query1, aes(x = CellTypes, y = PC1, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC3") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")

Query dataset when CD14 cell type is removed from reference dataset

In this section, we focus on the query dataset that was generated after removing CD14 cell types from the reference dataset. We perform analysis on this modified dataset to explore how the Principal Components (PCs) derived from PCA relate to the predicted cell labels.

PC1 explains approximately 11.06% of the total variance in the data. This value suggests that PC1 captures a moderate amount of variation in the dataset. While not extremely high, it still contributes significantly to the understanding of the differences among the data points.

PC2 explains approximately 86.10% of the total variance in the data. This high R-squared value indicates that PC2 captures a substantial amount of variation in the dataset. PC2 is a dominant component in explaining the differences among the data points.

PC3 explains approximately 44.17% of the total variance in the data. This value indicates that PC3 captures a moderate to high amount of variation in the dataset.

PC4 to PC10 each explain less than 10% of the total variance in the data. These components contribute relatively little to the overall variance and might capture noise or less significant patterns in the dataset.

# Data preparation
query2 <- data.frame(reducedDim(query, "PCA"), CellTypes = tab[[2]]$annotations$labels)
rsquared_df <- regress_PCs(query2)

# Create a line plot using ggplot
ggplot(rsquared_df, aes(x = PC, y = R2)) +
  geom_line() +
  geom_point(shape = 16) +
  labs(x = "Principal Component (PC)",
       y = "R-squared",
       title = "R-squared for Principal Components (CD14 cells removed from reference dataset)") +
  ylim(0, 1) +
  theme_bw() + 
  scale_x_continuous(breaks = 1:10, labels = 1:10)

Scatter plots between different PCs for query dataset (from reference CD14 cells removed)

We create scatter plots to visualize the PCA results for the query dataset with CD14 cell types removed from the reference.

# Define color palette
seeds <- c("#FF0000FF", "#CCFF00FF", "#00FF66FF", "#0066FFFF", "#CC00FFFF")
color_palette <- createPalette(length(unique(query2$CellType)), seeds)
names(color_palette) <- unique(query2$CellType)

PC1 vs PC2 scatter plot

In the scatter plot of PC1 vs PC2 for the query dataset (with reference CD14 cells removed), the following observations can be made:

Two Populations/Clusters of CD8 Cells: There are two distinct clusters or populations of CD8 cells visible on the plot. One of these clusters overlaps with CD4 cells, indicating some degree of similarity in their gene expression patterns.

NK cells are well-separated from the other cell types, suggesting that the variation captured by PC1 and PC2 is effective in distinguishing NK cells from the rest.

scatter_to_plotly(query2, "PC1", "PC2", "CellTypes", "PC1 vs PC2")

PC1 vs PC4 scatter plot

scatter_to_plotly(query2, "PC4", "PC1", "CellTypes", "PC1 vs PC4")

PC2 vs PC4 scatter plot

scatter_to_plotly(query2, "PC4", "PC2", "CellTypes", "PC4 vs PC2")

Boxplots of PC1 vs Cell types

The boxplots of PC1 values for different cell types reveal that the CD8 population exhibits a larger spread or variability along the PC1 axis compared to the CD4 and NK populations. This suggests that there might be greater heterogeneity or diversity within the CD8 cell population in terms of the gene expression patterns captured by PC1. On the other hand, the CD4 and NK populations seem to be more tightly clustered or have less variation along PC1.

ggplot(query2, aes(x = CellTypes, y = PC1, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC1") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")

Boxplots of PC2 vs Cell types

PC2 contributes to the separation of NK cells from the other two cell types.

ggplot(query2, aes(x = CellTypes, y = PC2, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC2") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")

Query dataset when CD8 cell type is removed from reference dataset

In this section, we focus on the query dataset that was generated after removing CD8 cell types from the reference dataset. We perform analysis on this modified dataset to explore how the Principal Components (PCs) derived from PCA relate to the predicted cell labels.

From the plot, we can see that PC1 and PC2 capture the majority of the variance in the data, with PC1 explaining the most variance. The subsequent PCs (PC3 to PC10) capture less variance and may represent smaller or more subtle patterns in the data.

# Data preparation
query3 <- data.frame(reducedDim(query, "PCA"), CellTypes = tab[[3]]$annotations$labels)
rsquared_df <- regress_PCs(query3)

# Create a line plot using ggplot
ggplot(rsquared_df, aes(x = PC, y = R2)) +
  geom_line() +
  geom_point(shape = 16) +
  labs(x = "Principal Component (PC)",
       y = "R-squared",
       title = "R-squared for Principal Components (CD8 cells removed from reference dataset)") +
  ylim(0, 1) +
  theme_bw() + 
  scale_x_continuous(breaks = 1:10, labels = 1:10)

Scatter plots between different PCs for query dataset (from reference CD8 cells removed)

We created scatter plots to visualize the PCA results for the query dataset with CD8 cell types removed from the reference.

# Define color palette
seeds <- c("#FF0000FF", "#CCFF00FF", "#00FF66FF", "#0066FFFF", "#CC00FFFF")
color_palette <- createPalette(length(unique(query3$CellType)), seeds)
names(color_palette) <- unique(query3$CellType)

PC1 vs PC2 scatter plot

The scatter plot of PC1 against PC2 for the query dataset, with reference CD8 cells removed, shows clear separation between different cell types. This separation could indicate that the variability captured by PC1 and PC2 is significant enough to differentiate between the remaining cell types in your dataset.

scatter_to_plotly(query3, "PC1", "PC2", "CellTypes", "PC1 vs PC2")

PC1 vs PC3 scatter plot

CD14 cells are separated from the other cell types along the PC3 axis. This indicates that the variation captured by PC3 contributes significantly to the differences between CD14 cells and the rest.

scatter_to_plotly(query3, "PC3", "PC1", "CellTypes", "PC1 vs PC3")

PC2 vs PC3 scatter plot

scatter_to_plotly(query3, "PC3", "PC2", "CellTypes", "PC2 vs PC3")

Boxplots of PC1 vs Cell types

ggplot(query3, aes(x = CellTypes, y = PC1, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC1") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")

Boxplots of PC2 vs Cell types

ggplot(query3, aes(x = CellTypes, y = PC2, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC2") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")

Boxplots of PC3 vs Cell types

ggplot(query3, aes(x = CellTypes, y = PC3, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC3") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")

Query dataset when NK cell type is removed from reference dataset

In this section, we focus on the query dataset that was generated after removing NK cell types from the reference dataset. We perform analysis on this modified dataset to explore how the Principal Components (PCs) derived from PCA relate to the predicted cell labels.

PC1 has an R-squared value of 0.949, indicating that it captures a substantial amount of variation in the dataset. It suggests that PC1 explains a large portion of the total variability.

PC2 has an R-squared value of 0.185, which is significantly lower than PC1. This means that PC2 explains a smaller portion of the total variability compared to PC1.

PC3 has an R-squared value of 0.396, indicating that it captures a moderate amount of variation. It’s higher than PC2 but not as high as PC1.

PCs 4 to 10 have much lower R-squared values ranging from 0.024 to 0.001, suggesting that they explain relatively small amounts of variation in the dataset.

# Data preparation
query4 <- data.frame(reducedDim(query, "PCA"), CellTypes = tab[[4]]$annotations$labels)
rsquared_df <- regress_PCs(query4)

# Create a line plot using ggplot
ggplot(rsquared_df, aes(x = PC, y = R2)) +
  geom_line() +
  geom_point(shape = 16) +
  labs(x = "Principal Component (PC)",
       y = "R-squared",
       title = "R-squared for Principal Components (NK cells removed from reference dataset)") +
  ylim(0, 1) +
  theme_bw() + 
  scale_x_continuous(breaks = 1:10, labels = 1:10)

Scatter plots between different PCs for query dataset (from reference NK cells removed)

We create scatter plots to visualize the PCA results for the query dataset with NK cell types removed from the reference.

# Define color palette
seeds <- c("#FF0000FF", "#CCFF00FF", "#00FF66FF", "#0066FFFF", "#CC00FFFF")
color_palette <- createPalette(length(unique(query4$CellType)), seeds)
names(color_palette) <- unique(query4$CellType)

PC1 vs PC2 scatter plot

CD14 cells are well separated from the other cell types along the PC1 axis.

There are two distinct populations of CD8 cells: one population is somewhat separate from the other, and the latter population has some overlap with CD4 cells.

scatter_to_plotly(query4, "PC1", "PC2", "CellTypes", "PC1 vs PC2")

PC1 vs PC3 scatter plot

CD14 cells continue to show a clear separation from the other cell types, indicating distinct differences in gene expression patterns or other molecular characteristics.

scatter_to_plotly(query4, "PC3", "PC1", "CellTypes", "PC1 vs PC3")

PC1 vs PC4 scatter plot

scatter_to_plotly(query4, "PC4", "PC1", "CellTypes", "PC1 vs PC4")

PC2 vs PC3 scatter plot

scatter_to_plotly(query4, "PC3", "PC2", "CellTypes", "PC3 vs PC2")

PC3 vs PC4 scatter plot

scatter_to_plotly(query4, "PC3", "PC4", "CellTypes", "PC3 vs PC4")

Boxplots of PC1 vs Cell types

PC1 captures significant variation that distinguishes CD14 cells from CD4 and CD8 cells.

ggplot(query4, aes(x = CellTypes, y = PC1, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC1") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")

Boxplots of PC2 vs Cell types

The following analysis reveals:

CD8 Cell Population Spread: The noticeable spread in the CD8 cell population along PC2 indicates that there is significant variability in the gene expression or molecular characteristics of CD8 cells that is captured by this principal component. This spread might suggest that there are subgroups or subtypes within the CD8 cell population that exhibit distinct patterns along PC2.

Two CD8 Cell Populations: The presence of two distinct CD8 cell populations in the PC1 vs PC2 scatter plot aligns with the spread seen in the boxplots of PC2 vs Cell types. This suggests that the variation captured by PC2 contributes to the differentiation between these two CD8 cell populations. The spread seen in the boxplots might be attributed to the differences between these populations in terms of the captured variation along PC2.

ggplot(query4, aes(x = CellTypes, y = PC2, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC2") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")

Boxplots of PC3 vs Cell types

ggplot(query4, aes(x = CellTypes, y = PC3, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC3") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")

Distance analysis between distinct cell types of reference and query dataset

Note: Distances are being calculated for cell types in original/observed reference and query dataset.

Identified marker genes using trainSingleR() function from SingleR package and used these genes to compute distances between cells in reference and query dataset.

Distance between CD4 cells in reference and query

calculatePairwiseDistancesAndPlotDensity(query_subset, reference_subset, "label", "label", "CD4", "CD4", "euclidean")

Distance between CD14 cells in reference and query

calculatePairwiseDistancesAndPlotDensity(query_subset, reference_subset, "label", "label", "CD14", "CD14", "euclidean")

Distance between CD8 cells in reference and query

calculatePairwiseDistancesAndPlotDensity(query_subset, reference_subset, "label", "label", "CD8", "CD8", "euclidean")

Distance between NK cells in reference and query

calculatePairwiseDistancesAndPlotDensity(query_subset, reference_subset, "label", "label", "NK", "NK", "euclidean")

Distance analysis between cell types removed from reference and assigned cell type in query dataset

In this section, we calculate Euclidean distances between the removed cell types in the reference and their assigned cell types in the query.

# Extract labels from each element in the 'tab' list using lapply
label_lists <- lapply(tab, function(x) x$annotations$labels)
annotations <- do.call(data.frame, label_lists)
colnames(annotations) <- c("labels_CD4_removed", "labels_CD14_removed", "labels_CD8_removed", "labels_NK_removed")

Distance computation between CD4 cells in query and CD8 cells in reference

When, CD8 cells were removed from the reference and when this reference was used to annotate the query dataset, these cells were assigned as CD4 in query.

calculatePairwiseDistancesAndPlotDensity(query_subset, reference_subset, "labels_CD4_removed", "label", "CD8", "CD4", "euclidean")

Distance computation between CD8 cells in query and CD14 cells in reference

When, CD14 cells were removed from the reference and when this reference was used to annotate the query dataset, these cells were assigned as CD8 in query.

calculatePairwiseDistancesAndPlotDensity(query_subset, reference_subset, "labels_CD14_removed", "label", "CD8", "CD14", "euclidean")

Distance computation between CD4 cells in query and CD8 cells in reference

When, CD8 cells were removed from the reference and when this reference was used to annotate the query dataset, these cells were assigned as CD4 in query.

calculatePairwiseDistancesAndPlotDensity(query_subset, reference_subset, "labels_CD8_removed", "label", "CD4", "CD8", "euclidean")

Distance computation between CD8 cells in query and NK cells in reference

When, NK cells were removed from the reference and when this reference was used to annotate the query dataset, these cells were assigned as CD8 in query.

calculatePairwiseDistancesAndPlotDensity(query_subset, reference_subset, "labels_NK_removed", "label", "CD8", "NK", "euclidean")

Principal Component Analysis (PCA) Key Observations:

CD4 and CD8 Overlap in PC1 vs PC2 in the Query: The overlap observed between CD4 and CD8 cells in the PC1 vs PC2 plot of the query dataset suggests the presence of shared gene expression patterns between these two cell types. This shared pattern might lead to their proximity in the PCA space.

CD14, CD8, and NK Separation in PC1 vs PC2 After CD4 Removal: Upon removing CD4 cells from the reference dataset, there is an improved separation of CD14, CD8, and NK cells in the PC1 vs PC2 plot. This enhancement in separation could be attributed to the reduction of noise caused by CD4 cells, enabling a clearer distinction between these cell populations.

Two CD8 Populations in PC2 vs PC3 After CD4 Removal: The absence of CD4 cells from the reference dataset has uncovered previously masked heterogeneity within CD8 cells. This is evident from the emergence of two distinct CD8 subpopulations in the PC2 vs PC3 plot, indicating that CD4 cells were influencing the grouping of CD8 cells.

Two CD8 Populations in PC1 vs PC2 After CD14 Removal: Following the removal of CD14, there are notable changes in the variance captured by PC1 and PC2. This unmasking of variability has led to the identification of two distinct CD8 subpopulations in the PC1 vs PC2 plot, suggesting that CD14 cells were masking this heterogeneity.

Two CD4 Populations in PC2 vs PC3 After CD8 Removal: The absence of CD8 T cells in the reference dataset has revealed hidden variability within CD4 T cells. This is evident from the appearance of two distinct CD4 subpopulations in the PC2 vs PC3 plot, highlighting that the presence of CD8 cells was impacting the representation of CD4 cells.

Two CD8 Populations in PC1 vs PC2 After NK Removal: Removing NK cells from the reference dataset has had an influence on the positioning of CD8 T cells in the PCA space. This effect is seen in the emergence of two CD8 cell populations in the PC1 vs PC2 plot, indicating that the presence of NK cells was affecting the distribution of CD8 cells.

Distinct CD4, CD14, and CD8 Populations in PC1 vs PC3 After NK Removal: The removal of NK cells from the reference dataset has brought about clearer separation between CD4 T cells, CD14, and CD8 cells in the PC1 vs PC3 plot. The masking effect of NK cells on the distinction between these populations has been alleviated.

Two CD8 Populations in PC2 vs PC3 After CD4 Removal: The absence of CD4 cells in the reference dataset has allowed for the identification of subtle heterogeneity within CD8 in the PC2 vs PC3 plot. This finding underscores how the presence of CD4 cells was impacting the representation of CD8 cell variability.

---
title: "Principal Component Analysis on Query Dataset Mapped Using Reference, with One Cell Type Removed"
date: "2023-06-21"
output:
  html_document:
    theme: spacelab
    highlight: default
    code_download: true
    toc: true
    toc_depth: 3
    toc_float: 
            collapsed: false
---
  
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Summary of Analysis

In this analysis, we investigate the impact of removing specific cell types from the reference dataset and mapping/querying them using the modified reference. The main steps include:

Cell Type Removal: CD4, CD8, CD14, and NK cell types are systematically removed from the reference dataset.

Mapping and Querying: Utilizing the modified reference, we map the query dataset.

Principal Component Regression: Regression of principal components (PCs) against predicted cell types is performed. We visualize PC scatter plots and box plots.

Distance Calculation: Calculation of Euclidean distances between the cell types in the reference and the corresponding cell types in the query.

Inter-Cell Type Distances: Computation of distances between the cell types removed from the reference and the cell types to which they were assigned in the query.

# Introduction

Welcome to this document, where we delve into the results of a regression analysis involving Principal Components (PCs) and cell labels. Our exploration encompasses two distinct scenarios:

Regression of PCs against Observed Cell Labels in Query Dataset: This scenario involves the regression of Principal Components against the actual cell labels within the query dataset.

Regression of PCs against Predicted Cell Labels in Query Dataset: In this scenario, we perform regression of Principal Components against predicted cell labels in the query dataset. These predicted labels are derived from systematically excluding one cell at a time from the reference dataset. The modified reference is then employed to map the query dataset.

The analysis hinges on a Single-Cell RNA sequencing (scRNA-seq) 10x Genomics PBMC dataset. This dataset is divided into two subsets: the reference set (10x Single-Cell RNAseq PBMCs - Reference) and the query dataset (10x Single-Cell RNAseq PBMCs - Withheld Test Set). The dataset can be accessed from the following link: https://figshare.com/projects/Curated_10X_Data/137892


```{r Libraries, warning=FALSE,message=FALSE}
library(SingleR)
library(scater)
library(Polychrome)
library(plotly)
library(ggplot2)
```

# Data Processing

In this section, we load the reference and withheld datasets, log-transform the data, and explore the impact of removing different cell types from the reference dataset on prediction accuracy.

```{r Data processing, echo=F, warning=F}
# Load reference and withheld datasets
load("~/Downloads/10x_pbmcs_reference_sce.rda")
load("~/Downloads/10x_pbmcs_withheld_sce.rda")

# Log transform datasets
reference <- logNormCounts(pbmcs_reference_sce)
query <- logNormCounts(pbmcs_withheld_sce)
```

# PCA of reference

```{r PCA reference, echo=F, warning=F}
## Run PCA
reference <- runPCA(reference)
df_ref <- data.frame(reducedDim(reference, "PCA"), CellType = reference$label)

# Generate colors
seeds <- c("#FF0000FF", "#CCFF00FF", "#00FF66FF", "#0066FFFF", "#CC00FFFF")
cols <- createPalette(length(unique(df_ref$CellType)), seeds)
names(cols) <- unique(df_ref$CellType)

# PC1 vs PC2 scatter plot
p <- ggplot(df_ref, aes(x = PC1, y = PC2, color = CellType)) +
   geom_point(size = 0.5) +
     scale_color_manual(values = cols, guide = guide_legend(override.aes = list(size = 2))) +
     labs(x = "PC1", y = "PC2", title = "PC1 vs PC2 Scatter Plot") +
     theme_bw()
ggplotly(p)
```

# Analysis - Removing Cell Types

Here, we analyze the effect of removing individual cell types from the reference dataset on prediction accuracy using the SingleR package. We have a reference dataset containing CD4, CD8, CD14, and NK cell types. We remove CD4 cells from the reference and map query using this reference. Similarly, this process is performed for CD8, CD14, and NK cells individually, each time removing one cell type and mapping query with this modified reference.

```{r Analysis - Removing Cell Types, echo=T, warning=F}

# Removal of one cell type at a time from reference
tab <- lapply(unique(reference$label), function(i) {
  indx <- which(reference$label == i)
  refn <- reference[, -indx]
  annotations <- SingleR(query, ref = refn, labels = refn$label)
  table_result <- as.data.frame(table(annotations$labels, query$label))
  return(list(annotations = annotations, table_result = table_result))
})

df <- mapply(function(tbl_result, label) {
  indx <- which(tbl_result$Var2 == label)
  tbl_result[indx, ]
}, lapply(tab, `[[`, "table_result"), unique(reference$label), SIMPLIFY = FALSE)

df = do.call(rbind, df)
df <- subset(df, Freq > 0)

# Concordance plot
df$Var2 <- factor(df$Var2, levels = c("CD8", "NK", "CD4", "CD14"))
ggplot(df, aes(x = Var1, y = Var2, size = Freq)) + 
  geom_point() + 
  xlab("Predicted labels") + 
  ylab("Removed labels") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```


The scatter plot above displays removed cell types from reference on y axis and predicted cell type on x 
axis. For instance, if from reference dataset CD8 cells are removed, It will be predicted as CD4 cells in a query dataset.

# PCA and Regression analysis

In this section, we apply PCA to the query dataset and perform regression analysis to understand the relationship between principal components and observed cell labels from query dataset.

```{r Regression analysis, echo=T, warning=F}
query <- runPCA(query)

# Function to regress PCs against cell labels
regress_PCs <- function(data) {
  
  # Calculate R-squared values
  lm_models <- lapply(1:10, function(i) {
    lm_model <- lm(paste0("PC", i, " ~ CellTypes"), data = data)
    return(lm_model)
  })
  
  # Calculate R-squared values
  rsquared <- sapply(lm_models, function(model) {
    rsq <- summary(model)$r.squared
    return(rsq)
  })
  
  rsquared_df <- data.frame(PC = 1:10, R2 = rsquared)
  return(rsquared_df)
}

# Function to create scatter plot and convert to plotly
scatter_to_plotly <- function(data, x_col, y_col, color_col, title) {
  p <- ggplot(data, aes_string(x = x_col, y = y_col, color = color_col)) +
    geom_point(size = 0.5) +
    scale_color_manual(values = color_palette, guide = guide_legend(override.aes = list(size = 2))) +
    labs(x = x_col, y = y_col, title = title) +
    theme_bw()
  
  return(ggplotly(p))
}

calculatePairwiseDistancesAndPlotDensity <- function(query_data, ref_data, query_cell_type_col, ref_cell_type_col, cell_type_query, cell_type_reference,distance_metric, correlation_method = "pearson") {
  # Subset query and reference data to the specified cell type
  query_data_subset <- query_data[, !is.na(query_data[[query_cell_type_col]]) & query_data[[query_cell_type_col]] == cell_type_query]
  ref_data_subset <- ref_data[, !is.na(ref_data[[ref_cell_type_col]]) & ref_data[[ref_cell_type_col]] == cell_type_reference]
  
  # Convert to matrix
  query_mat <- t(as.matrix(assay(query_data_subset, "logcounts")))
  ref_mat <- t(as.matrix(assay(ref_data_subset, "logcounts")))
  
  # Combine query and reference matrices
  combined_mat <- rbind(query_mat, ref_mat)
  
  # Calculate pairwise distances or correlations for all comparisons
  if (distance_metric == "correlation") {
    if (correlation_method == "pearson") {
      dist_matrix <- cor(t(combined_mat), method = "pearson")
    } else if (correlation_method == "spearman") {
      dist_matrix <- cor(t(combined_mat), method = "spearman")
    } else {
      stop("Invalid correlation method. Available options: 'pearson', 'spearman'")
    }
  } else {
    dist_matrix <- dist(combined_mat, method = distance_metric)
  }
  
  # Convert dist_matrix to a square matrix
  dist_matrix <- as.matrix(dist_matrix)
  
  # Extract the distances or correlations for the different pairwise comparisons
  num_query_cells <- nrow(query_mat)
  num_ref_cells <- nrow(ref_mat)
  dist_query_query <- dist_matrix[1:num_query_cells, 1:num_query_cells]
  dist_ref_ref <- dist_matrix[(num_query_cells+1):(num_query_cells+num_ref_cells), (num_query_cells+1):(num_query_cells+num_ref_cells)]
  dist_query_ref <- dist_matrix[1:num_query_cells, (num_query_cells+1):(num_query_cells+num_ref_cells)]
  
  # Create data frame for plotting
  dist_df <- data.frame(
    Comparison = c(rep("Query vs Query", length(dist_query_query)),
                   rep("Reference vs Reference", length(dist_ref_ref)),
                   rep("Query vs Reference", length(dist_query_ref))),
    Distance = c(as.vector(dist_query_query),
                 as.vector(dist_ref_ref),
                 as.vector(dist_query_ref))
  )
  
  # Plot density plots
  ggplot(dist_df, aes(x = Distance, color = Comparison)) +
    geom_density() +
    labs(x = ifelse(distance_metric == "correlation", paste(correlation_method, "correlation"), "Distance"), y = "Density", title = "Pairwise Distance Analysis and Density Visualization") +
    theme_bw()
}
```

# Visualizing Regression Results

Note: Here original/observed query data cell labels are used.

We visualize the outcome of the regression analysis by creating line plots that depict the R-squared values for each principal component. 

PC1: The high R-squared value indicates that a significant portion of the variability in PC1 is explained by the cell labels. This suggests that PC1 captures meaningful differences between the cell types.

PC2: Captures a substantial amount of variation related to cell labels. This component contributes significantly to distinguishing between different cell types.

PC3: The R-squared value of PC3 indicates that it still captures a moderate amount of variation explained by cell labels, although it is less influential than PC1 and PC2.

PC4 to PC10: The R-squared values for PC4 to PC10 are relatively lower, ranging from 0.0366 to 0.0073. These components capture much less of the variation explained by the cell labels. Their low R-squared values might suggest that these principal components might be influenced by other factors not accounted for by the provided cell labels

```{r R2 analysis, echo=T, warning=F}
# Data preparation
query_observed <- data.frame(reducedDim(query, "PCA"), CellTypes = query$label)
rsquared_df <- regress_PCs(query_observed)

# Create a line plot using ggplot
ggplot(rsquared_df, aes(x = PC, y = R2)) +
  geom_line() +
  geom_point(shape = 16) +
  labs(x = "Principal Component (PC)",
       y = "R-squared",
       title = "R-squared for Principal Components") +
  ylim(0, 1) +
  theme_bw() + 
  scale_x_continuous(breaks = 1:10, labels = 1:10)
```

# Scatter plots between different PCs for query dataset (Observed query dataset)

```{r scatter plots query observed, echo=T, warning=F}

# Define color palette
seeds <- c("#FF0000FF", "#CCFF00FF", "#00FF66FF", "#0066FFFF", "#CC00FFFF")
color_palette <- createPalette(length(unique(query_observed$CellType)), seeds)
names(color_palette) <- unique(query_observed$CellType)
```

## PC1 vs PC2 scatter plot

The scatter plot of PC1 vs PC2 indicates that there are separate clusters for NK cells and CD14 cells, which means that these two cell types are well-separated from the others in this two-dimensional space defined by PC1 and PC2. This separation suggests that PC1 and PC2 capture distinct variations that allow for distinguishing between NK cells and CD14 cells from other cell types.

```{r scatter plots query observed PC1 vs PC2, echo=T, warning=F}
scatter_to_plotly(query_observed, "PC1", "PC2", "CellTypes", "PC1 vs PC2")
```

## PC1 vs PC3 scatter plot

The variation captured by PC3 contributes to the separation of CD14 cells from other cell types when looking at the plot with PC3.

```{r scatter plots query observed PC1 vs PC3, echo=T}
scatter_to_plotly(query_observed, "PC3", "PC1", "CellTypes", "PC1 vs PC3")
```

## PC1 vs PC4 scatter plot

Similar to above results, the variation captured by PC4 contributes to the separation of CD14 cells from other cell types.

```{r scatter plots query observed PC1 vs PC4, echo=T}
scatter_to_plotly(query_observed, "PC4", "PC1", "CellTypes", "PC1 vs PC4")
```

## PC4 vs PC5 scatter plot

Plot of PC4 vs PC5 shows a mixed pattern with no clear separation or clustering of cell types, suggesting that the variation captured by these PCs may not have a strong discriminatory effect on the different cell types in the dataset. 

```{r scatter plots query observed PC4 vs PC5, echo=T}
scatter_to_plotly(query_observed, "PC4", "PC5", "CellTypes", "PC4 vs PC5")
```

## Boxplots of PC1 vs Cell types

CD14 cells might exhibit distinct molecular characteristics that set them apart from the other three cell types.

```{r PC1 vs cluster boxplots query observed, echo=T}
ggplot(query_observed, aes(x = CellTypes, y = PC1, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC1") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")
```

## Boxplots of PC2 vs Cell types

These results indicate that NK cells might exhibit distinct gene expression patterns or molecular characteristics captured by PC2.

```{r PC2 vs cluster boxplots query observed, echo=T}
ggplot(query_observed, aes(x = CellTypes, y = PC2, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC2") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")
```

## Boxplots of PC3 vs Cell types

```{r PC3 vs cluster boxplots query observed, echo=T}
ggplot(query_observed, aes(x = CellTypes, y = PC3, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC3") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")
```

# Query dataset when CD4 cell type is removed from reference dataset

In this section, we focus on the query dataset that was generated after removing CD4 cell types from the reference dataset. We perform analysis on this modified dataset to explore how the Principal Components (PCs) relate to the predicted cell labels.
 
PC1 captures a high amount of variance, suggesting that it explains a substantial portion of the variability in the data.

PC2 also captures a significant amount of variance, indicating its importance in explaining variability.

The R-squared values for PC3 to PC10 are much lower, indicating that these components capture relatively less variance in the data.

```{r R2 query (CD4 remvoved from reference), echo=T}
# Data preparation
query1 <- data.frame(reducedDim(query, "PCA"), CellTypes = tab[[1]]$annotations$labels)
rsquared_df <- regress_PCs(query1)

# Create a line plot using ggplot
ggplot(rsquared_df, aes(x = PC, y = R2)) +
  geom_line() +
  geom_point(shape = 16) +
  labs(x = "Principal Component (PC)",
       y = "R-squared",
       title = "R-squared for Principal Components (CD4 cells removed from reference dataset)") +
  ylim(0, 1) +
  theme_bw() + 
  scale_x_continuous(breaks = 1:10, labels = 1:10)
```

# Scatter plots between different PCs for query dataset (from reference CD4 cells removed)

We create scatter plots to visualize the PCA results for the query dataset with CD4 cell types removed from the reference.

```{r Sactter plots query (CD4 remvoved from reference), echo=T}
# Define color palette
seeds <- c("#FF0000FF", "#CCFF00FF", "#00FF66FF", "#0066FFFF", "#CC00FFFF")
color_palette <- createPalette(length(unique(query1$CellType)), seeds)
names(color_palette) <- unique(query1$CellType)
```

## PC1 vs PC2 scatter plot

The scatter plot of PC1 vs PC2 for the query dataset with CD4 cell types removed from the reference demonstrates a distinct separation of clusters. This separation suggests that the variations captured by PC1 and PC2 are effective at distinguishing different cell types in the query dataset.

```{r Sactter plots query (CD4 remvoved from reference) PC1 vs PC2, echo=T}
scatter_to_plotly(query1, "PC1", "PC2", "CellTypes", "PC1 vs PC2")
```

## PC1 vs PC3 scatter plot

The separation of CD14 cells from the others along the PC3 axis is notable, indicating that the variation captured by PC3 contributes significantly to distinguishing CD14 cells from the rest. Additionally, the proximity of CD8 and NK cells along the PC3 axis suggests that these cell types may share some common gene expression patterns or molecular characteristics captured by PC3. 

Also, concordance plot above revealed that when CD8 cells were removed from the reference dataset, a small subset of cells in the query dataset were being assigned as NK cells.

```{r Sactter plots query (CD4 remvoved from reference) PC1 vs PC3, echo=T}
scatter_to_plotly(query1, "PC3", "PC1", "CellTypes", "PC1 vs PC3")
```

## PC2 vs PC3 scatter plot

Separation of NK cells from other cell types seems to be influenced by both PC2 and PC3,

```{r Sactter plots query (CD4 remvoved from reference) PC2 vs PC3, echo=T}
scatter_to_plotly(query1, "PC3", "PC2", "CellTypes", "PC3 vs PC2")
```

## Boxplots of PC1 vs Cell types

```{r PC1 vs cluster boxplots query1, echo=T}
ggplot(query1, aes(x = CellTypes, y = PC1, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC1") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")
```

## Boxplots of PC2 vs Cell types

```{r PC2 vs cluster boxplots query1, echo=T}
ggplot(query1, aes(x = CellTypes, y = PC1, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC2") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")
```

## Boxplots of PC3 vs Cell types

```{r P3 vs cluster boxplots query1, echo=T}
ggplot(query1, aes(x = CellTypes, y = PC1, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC3") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")
```

# Query dataset when CD14 cell type is removed from reference dataset

In this section, we focus on the query dataset that was generated after removing CD14 cell types from the reference dataset. We perform analysis on this modified dataset to explore how the Principal Components (PCs) derived from PCA relate to the predicted cell labels.

PC1 explains approximately 11.06% of the total variance in the data.
This value suggests that PC1 captures a moderate amount of variation in the dataset. While not extremely high, it still contributes significantly to the understanding of the differences among the data points.

PC2 explains approximately 86.10% of the total variance in the data.
This high R-squared value indicates that PC2 captures a substantial amount of variation in the dataset. PC2 is a dominant component in explaining the differences among the data points.

PC3 explains approximately 44.17% of the total variance in the data.
This value indicates that PC3 captures a moderate to high amount of variation in the dataset.

PC4 to PC10 each explain less than 10% of the total variance in the data.
These components contribute relatively little to the overall variance and might capture noise or less significant patterns in the dataset.

```{r R2 query (CD14 remvoved from reference), echo=T}

# Data preparation
query2 <- data.frame(reducedDim(query, "PCA"), CellTypes = tab[[2]]$annotations$labels)
rsquared_df <- regress_PCs(query2)

# Create a line plot using ggplot
ggplot(rsquared_df, aes(x = PC, y = R2)) +
  geom_line() +
  geom_point(shape = 16) +
  labs(x = "Principal Component (PC)",
       y = "R-squared",
       title = "R-squared for Principal Components (CD14 cells removed from reference dataset)") +
  ylim(0, 1) +
  theme_bw() + 
  scale_x_continuous(breaks = 1:10, labels = 1:10)
```

# Scatter plots between different PCs for query dataset (from reference CD14 cells removed)

We create scatter plots to visualize the PCA results for the query dataset with CD14 cell types removed from the reference.

```{r Sactter plots query (CD14 remvoved from reference), echo=T}
# Define color palette
seeds <- c("#FF0000FF", "#CCFF00FF", "#00FF66FF", "#0066FFFF", "#CC00FFFF")
color_palette <- createPalette(length(unique(query2$CellType)), seeds)
names(color_palette) <- unique(query2$CellType)
```

## PC1 vs PC2 scatter plot

In the scatter plot of PC1 vs PC2 for the query dataset (with reference CD14 cells removed), the following observations can be made:

Two Populations/Clusters of CD8 Cells: There are two distinct clusters or populations of CD8 cells visible on the plot. One of these clusters overlaps with CD4 cells, indicating some degree of similarity in their gene expression patterns.

NK cells are well-separated from the other cell types, suggesting that the variation captured by PC1 and PC2 is effective in distinguishing NK cells from the rest.

```{r Sactter plots query (CD14 remvoved from reference) PC1 vs PC2, echo=T}
scatter_to_plotly(query2, "PC1", "PC2", "CellTypes", "PC1 vs PC2")
```

## PC1 vs PC4 scatter plot

```{r Sactter plots query (CD14 remvoved from reference) PC1 vs PC4, echo=T}
scatter_to_plotly(query2, "PC4", "PC1", "CellTypes", "PC1 vs PC4")
```

## PC2 vs PC4 scatter plot

```{r Sactter plots query (CD14 remvoved from reference) PC2 vs PC4, echo=T}
scatter_to_plotly(query2, "PC4", "PC2", "CellTypes", "PC4 vs PC2")
```

## Boxplots of PC1 vs Cell types

The boxplots of PC1 values for different cell types reveal that the CD8 population exhibits a larger spread or variability along the PC1 axis compared to the CD4 and NK populations. This suggests that there might be greater heterogeneity or diversity within the CD8 cell population in terms of the gene expression patterns captured by PC1. On the other hand, the CD4 and NK populations seem to be more tightly clustered or have less variation along PC1.

```{r PC1 vs cluster boxplots query2, echo=T}
ggplot(query2, aes(x = CellTypes, y = PC1, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC1") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")
```

## Boxplots of PC2 vs Cell types

PC2 contributes to the separation of NK cells from the other two cell types.

```{r PC2 vs cluster boxplots query 2, echo=T}
ggplot(query2, aes(x = CellTypes, y = PC2, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC2") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")
```

# Query dataset when CD8 cell type is removed from reference dataset

In this section, we focus on the query dataset that was generated after removing CD8 cell types from the reference dataset. We perform analysis on this modified dataset to explore how the Principal Components (PCs) derived from PCA relate to the predicted cell labels.

From the plot, we can see that PC1 and PC2 capture the majority of the variance in the data, with PC1 explaining the most variance. The subsequent PCs (PC3 to PC10) capture less variance and may represent smaller or more subtle patterns in the data.

```{r R2 query (CD8 remvoved from reference), echo=T}

# Data preparation
query3 <- data.frame(reducedDim(query, "PCA"), CellTypes = tab[[3]]$annotations$labels)
rsquared_df <- regress_PCs(query3)

# Create a line plot using ggplot
ggplot(rsquared_df, aes(x = PC, y = R2)) +
  geom_line() +
  geom_point(shape = 16) +
  labs(x = "Principal Component (PC)",
       y = "R-squared",
       title = "R-squared for Principal Components (CD8 cells removed from reference dataset)") +
  ylim(0, 1) +
  theme_bw() + 
  scale_x_continuous(breaks = 1:10, labels = 1:10)
```

# Scatter plots between different PCs for query dataset (from reference CD8 cells removed)

We created scatter plots to visualize the PCA results for the query dataset with CD8 cell types removed from the reference.

```{r Sactter plots query (CD8 remvoved from reference), echo=T}
# Define color palette
seeds <- c("#FF0000FF", "#CCFF00FF", "#00FF66FF", "#0066FFFF", "#CC00FFFF")
color_palette <- createPalette(length(unique(query3$CellType)), seeds)
names(color_palette) <- unique(query3$CellType)
```

## PC1 vs PC2 scatter plot

The scatter plot of PC1 against PC2 for the query dataset, with reference CD8 cells removed, shows clear separation between different cell types. This separation could indicate that the variability captured by PC1 and PC2 is significant enough to differentiate between the remaining cell types in your dataset. 

```{r Sactter plots query (CD8 remvoved from reference) PC1 vs PC2, echo=T}
scatter_to_plotly(query3, "PC1", "PC2", "CellTypes", "PC1 vs PC2")
```

## PC1 vs PC3 scatter plot

CD14 cells are separated from the other cell types along the PC3 axis. This indicates that the variation captured by PC3 contributes significantly to the differences between CD14 cells and the rest. 

```{r Sactter plots query (CD8 remvoved from reference) PC1 vs PC3, echo=T}
scatter_to_plotly(query3, "PC3", "PC1", "CellTypes", "PC1 vs PC3")
```

## PC2 vs PC3 scatter plot

```{r Sactter plots query (CD8 remvoved from reference) PC2 vs PC3, echo=T}
scatter_to_plotly(query3, "PC3", "PC2", "CellTypes", "PC2 vs PC3")
```

## Boxplots of PC1 vs Cell types

```{r PC1 vs cluster boxplots query3, echo =T}
ggplot(query3, aes(x = CellTypes, y = PC1, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC1") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")
```

## Boxplots of PC2 vs Cell types

```{r PC2 vs cluster boxplots query3, echo=T}
ggplot(query3, aes(x = CellTypes, y = PC2, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC2") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")
```

## Boxplots of PC3 vs Cell types

```{r PC3 vs cluster boxplots query3, echo=T}
ggplot(query3, aes(x = CellTypes, y = PC3, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC3") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")
```

# Query dataset when NK cell type is removed from reference dataset

In this section, we focus on the query dataset that was generated after removing NK cell types from the reference dataset. We perform analysis on this modified dataset to explore how the Principal Components (PCs) derived from PCA relate to the predicted cell labels.

PC1 has an R-squared value of 0.949, indicating that it captures a substantial amount of variation in the dataset. It suggests that PC1 explains a large portion of the total variability.

PC2 has an R-squared value of 0.185, which is significantly lower than PC1. This means that PC2 explains a smaller portion of the total variability compared to PC1.

PC3 has an R-squared value of 0.396, indicating that it captures a moderate amount of variation. It's higher than PC2 but not as high as PC1.

PCs 4 to 10 have much lower R-squared values ranging from 0.024 to 0.001, suggesting that they explain relatively small amounts of variation in the dataset.

```{r R2 query (NK remvoved from reference), echo=T}
# Data preparation
query4 <- data.frame(reducedDim(query, "PCA"), CellTypes = tab[[4]]$annotations$labels)
rsquared_df <- regress_PCs(query4)

# Create a line plot using ggplot
ggplot(rsquared_df, aes(x = PC, y = R2)) +
  geom_line() +
  geom_point(shape = 16) +
  labs(x = "Principal Component (PC)",
       y = "R-squared",
       title = "R-squared for Principal Components (NK cells removed from reference dataset)") +
  ylim(0, 1) +
  theme_bw() + 
  scale_x_continuous(breaks = 1:10, labels = 1:10)
```

# Scatter plots between different PCs for query dataset (from reference NK cells removed)

We create scatter plots to visualize the PCA results for the query dataset with NK cell types removed from the reference.

```{r Sactter plots query (NK remvoved from reference), echo=T}

# Define color palette
seeds <- c("#FF0000FF", "#CCFF00FF", "#00FF66FF", "#0066FFFF", "#CC00FFFF")
color_palette <- createPalette(length(unique(query4$CellType)), seeds)
names(color_palette) <- unique(query4$CellType)
```

## PC1 vs PC2 scatter plot

CD14 cells are well separated from the other cell types along the PC1 axis.

There are two distinct populations of CD8 cells: one population is somewhat separate from the other, and the latter population has some overlap with CD4 cells.

```{r Sactter plots query (NK remvoved from reference) PC1 vs PC2, echo=T}
scatter_to_plotly(query4, "PC1", "PC2", "CellTypes", "PC1 vs PC2")
```

## PC1 vs PC3 scatter plot

CD14 cells continue to show a clear separation from the other cell types, indicating distinct differences in gene expression patterns or other molecular characteristics.

```{r Sactter plots query (NK remvoved from reference) PC1 vs PC3, echo=T}
scatter_to_plotly(query4, "PC3", "PC1", "CellTypes", "PC1 vs PC3")
```

## PC1 vs PC4 scatter plot

```{r Sactter plots query (NK remvoved from reference) PC1 vs PC4, echo=T}
scatter_to_plotly(query4, "PC4", "PC1", "CellTypes", "PC1 vs PC4")
```

## PC2 vs PC3 scatter plot

```{r Sactter plots query (NK remvoved from reference) PC2 vs PC3, echo=T}
scatter_to_plotly(query4, "PC3", "PC2", "CellTypes", "PC3 vs PC2")
```

## PC3 vs PC4 scatter plot

```{r Sactter plots query (NK remvoved from reference) PC3 vs PC4, echo=T}
scatter_to_plotly(query4, "PC3", "PC4", "CellTypes", "PC3 vs PC4")
```

## Boxplots of PC1 vs Cell types

PC1 captures significant variation that distinguishes CD14 cells from CD4 and CD8 cells.

```{r PC1 vs cluster boxplots query4, echo=T}
ggplot(query4, aes(x = CellTypes, y = PC1, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC1") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")
```

## Boxplots of PC2 vs Cell types

The following analysis reveals:

CD8 Cell Population Spread: The noticeable spread in the CD8 cell population along PC2 indicates that there is significant variability in the gene expression or molecular characteristics of CD8 cells that is captured by this principal component. This spread might suggest that there are subgroups or subtypes within the CD8 cell population that exhibit distinct patterns along PC2.

Two CD8 Cell Populations: The presence of two distinct CD8 cell populations in the PC1 vs PC2 scatter plot aligns with the spread seen in the boxplots of PC2 vs Cell types. This suggests that the variation captured by PC2 contributes to the differentiation between these two CD8 cell populations. The spread seen in the boxplots might be attributed to the differences between these populations in terms of the captured variation along PC2.

```{r PC2 vs cluster boxplots query4, echo=T}
ggplot(query4, aes(x = CellTypes, y = PC2, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC2") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")
```

## Boxplots of PC3 vs Cell types

```{r PC3 vs cluster boxplots query4, echo=T}
ggplot(query4, aes(x = CellTypes, y = PC3, fill = CellTypes)) +
    geom_boxplot() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    guides(fill = "none") +
    xlab("CellTypes") +
    ylab("PC3") + geom_hline(yintercept = 0, linetype = "dashed", color = "red")
```

# Distance analysis between distinct cell types of reference and query dataset

Note: Distances are being calculated for cell types in original/observed reference and query dataset.

Identified marker genes using trainSingleR() function from SingleR package and used these genes to compute 
distances between cells in reference and query dataset.

```{r SingleR genes, echo=F, warning=F}
## Train SingleR classifier to get list of genes
train <- trainSingleR(
  reference,
  reference$label)

markers <- train$markers$unique

# Extract labels from each element in the 'tab' list using lapply
label_lists <- lapply(tab, function(x) x$annotations$labels)
annotations <- do.call(data.frame, label_lists)
colnames(annotations) <- c("labels_CD4_removed", "labels_CD14_removed", "labels_CD8_removed", "labels_NK_removed")

# Add annotations to query dataset
colData(query)[,c("labels_CD4_removed", "labels_CD14_removed", "labels_CD8_removed", "labels_NK_removed")] <- annotations

# Subset query and reference dataset based on markers
query_subset <- query[markers, ]
reference_subset <- reference[markers, ]
```

## Distance between CD4 cells in reference and query

```{r CD4 genes, echo=T}
calculatePairwiseDistancesAndPlotDensity(query_subset, reference_subset, "label", "label", "CD4", "CD4", "euclidean")
```

## Distance between CD14 cells in reference and query

```{r CD14 genes, echo=T}
calculatePairwiseDistancesAndPlotDensity(query_subset, reference_subset, "label", "label", "CD14", "CD14", "euclidean")
```

## Distance between CD8 cells in reference and query

```{r CD8 genes, echo=T}
calculatePairwiseDistancesAndPlotDensity(query_subset, reference_subset, "label", "label", "CD8", "CD8", "euclidean")
```

## Distance between NK cells in reference and query

```{r NK genes, echo=T}
calculatePairwiseDistancesAndPlotDensity(query_subset, reference_subset, "label", "label", "NK", "NK", "euclidean")
```

# Distance analysis between cell types removed from reference and assigned cell type in query dataset

In this section, we calculate Euclidean distances between the removed cell types in the reference and their assigned cell types in the query.

```{r data processing for distance analsyis, echo=T}
# Extract labels from each element in the 'tab' list using lapply
label_lists <- lapply(tab, function(x) x$annotations$labels)
annotations <- do.call(data.frame, label_lists)
colnames(annotations) <- c("labels_CD4_removed", "labels_CD14_removed", "labels_CD8_removed", "labels_NK_removed")
```

## Distance computation between CD4 cells in query and CD8 cells in reference

When, CD8 cells were removed from the reference and when this reference was used to annotate the query dataset, these cells were assigned as CD4 in query.

```{r CD8 vs CD4 genes, echo=T}
calculatePairwiseDistancesAndPlotDensity(query_subset, reference_subset, "labels_CD4_removed", "label", "CD8", "CD4", "euclidean")
```

## Distance computation between CD8 cells in query and CD14 cells in reference

When, CD14 cells were removed from the reference and when this reference was used to annotate the query dataset, these cells were assigned as CD8 in query.

```{r CD14 vs CD4 genes, echo=T}
calculatePairwiseDistancesAndPlotDensity(query_subset, reference_subset, "labels_CD14_removed", "label", "CD8", "CD14", "euclidean")
```

## Distance computation between CD4 cells in query and CD8 cells in reference

When, CD8 cells were removed from the reference and when this reference was used to annotate the query dataset, these cells were assigned as CD4 in query.

```{r CD4 vs CD8 genes, echo=T}
calculatePairwiseDistancesAndPlotDensity(query_subset, reference_subset, "labels_CD8_removed", "label", "CD4", "CD8", "euclidean")
```

## Distance computation between CD8 cells in query and NK cells in reference

When, NK cells were removed from the reference and when this reference was used to annotate the query dataset, these cells were assigned as CD8 in query.

```{r CD8 vs NK genes, echo=T}
calculatePairwiseDistancesAndPlotDensity(query_subset, reference_subset, "labels_NK_removed", "label", "CD8", "NK", "euclidean")
```

# Principal Component Analysis (PCA) Key Observations:

CD4 and CD8 Overlap in PC1 vs PC2 in the Query:
The overlap observed between CD4 and CD8 cells in the PC1 vs PC2 plot of the query dataset suggests the presence of shared gene expression patterns between these two cell types. This shared pattern might lead to their proximity in the PCA space.

CD14, CD8, and NK Separation in PC1 vs PC2 After CD4 Removal:
Upon removing CD4 cells from the reference dataset, there is an improved separation of CD14, CD8, and NK cells in the PC1 vs PC2 plot. This enhancement in separation could be attributed to the reduction of noise caused by CD4 cells, enabling a clearer distinction between these cell populations.

Two CD8 Populations in PC2 vs PC3 After CD4 Removal:
The absence of CD4 cells from the reference dataset has uncovered previously masked heterogeneity within CD8 cells. This is evident from the emergence of two distinct CD8 subpopulations in the PC2 vs PC3 plot, indicating that CD4 cells were influencing the grouping of CD8 cells.

Two CD8 Populations in PC1 vs PC2 After CD14 Removal:
Following the removal of CD14, there are notable changes in the variance captured by PC1 and PC2. This unmasking of variability has led to the identification of two distinct CD8 subpopulations in the PC1 vs PC2 plot, suggesting that CD14 cells were masking this heterogeneity.

Two CD4 Populations in PC2 vs PC3 After CD8 Removal:
The absence of CD8 T cells in the reference dataset has revealed hidden variability within CD4 T cells. This is evident from the appearance of two distinct CD4 subpopulations in the PC2 vs PC3 plot, highlighting that the presence of CD8 cells was impacting the representation of CD4 cells.

Two CD8 Populations in PC1 vs PC2 After NK Removal:
Removing NK cells from the reference dataset has had an influence on the positioning of CD8 T cells in the PCA space. This effect is seen in the emergence of two CD8 cell populations in the PC1 vs PC2 plot, indicating that the presence of NK cells was affecting the distribution of CD8 cells.

Distinct CD4, CD14, and CD8 Populations in PC1 vs PC3 After NK Removal:
The removal of NK cells from the reference dataset has brought about clearer separation between CD4 T cells, CD14, and CD8 cells in the PC1 vs PC3 plot. The masking effect of NK cells on the distinction between these populations has been alleviated.

Two CD8 Populations in PC2 vs PC3 After CD4 Removal:
The absence of CD4 cells in the reference dataset has allowed for the identification of subtle heterogeneity within CD8 in the PC2 vs PC3 plot. This finding underscores how the presence of CD4 cells was impacting the representation of CD8 cell variability.

